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Abstract 

We consider the segmentation problem of Poisson and negative binomial (i.e. 
overdispersed Poisson) rate distributions. In segmentation, an important issue re- 
mains the choice of the number of segments. To this end, we propose a penalized log- 
likelihood estimator where the penalty function is constructed in a non-asymptotic 
context following the works of L. Birge and P. Massart. The resulting estimator is 
proved to satisfy an oracle inequality. The performances of our criterion is assessed 
using simulated and real datasets in the RNA-seq data analysis context. 

Mathematics subject classification 2010: primary 62G05, 62G07; secondary 62P10 
Keywords and phrases: Density estimation; Change-points detection; Count data (RNA- 
seq); Poisson and negative binomial distributions; Model selection. 

Introduction 

We consider a multiple change-point detection setting for count datasets, which can be 
written as follows: we observe a finite sequence {yt}t&{i,...,n} realisation of independent 
variables Yj. These variables are supposed to be drawn from a probability distribution Q 
which depends on a set of parameters. Here two types of parameters are distinguished: 

Yt^g{eu(t))=s{t), 1 <t <n, 

where is a constant parameter while the ^s are point-specific. In many contexts, we 
might want to consider that the ^^s are piece-wise constant and so subject to an unknown 
number i^T — 1 of abrupt changes (for instance with climatic or financial data). Thus, we 
want to assume the existence of partition of {1, . . . , n} into K segments within which the 
observations follow the same distribution and between which observations have different 
distributions, i.e. 9 is constant within a segment and differ from a segment to another. 



A motivating example is sequencing data analysis. For instance, the output of RNA- 
seq experiments is the number of reads (i.e. short portions of the genome) which first 
position maps to each location of a genome of reference. Supposing that we dispose of such 
a sequence, we expect to observe a stationarity in the amount of reads falling in different 
areas of the genome: expressed genes, intronic regions, etc. We wish to localize those 
regions that are biologically significant. In our context, we consider for Q the Poisson and 
negative binomial distributions, adapted to RNA-seq experiment analysis [1]. 
Change-point detection problems are not new and many methods have been proposed 
in the literature. For count data-sets, [2] provide a detailed bibliography of methods 
in the particular case of the segmentation of the DNA sequences that includes Bayesian 
approaches, scan statistics, likelihood-ratio tests, binary segmentation and numerous other 
methods such as penahzed contrast estimation procedures. In a Bayesian framework, [3] 
proposes to use an exact "ICL" criterion for the choice of K, while its approximation is 
computed in the constrained HMM approach of [4]. In this paper, we consider a penalized 
contrast estimation method which consists first, for every fixed in finding the best 
segmentation in K segments by minimizing the contrast over all the partitions with K 
segments, and then in selecting a convenient number of segments K by penalizing the 
contrast. Choosing the number of segments, i.e. choosing a "good" penalty, is a crucial 
issue and not so easy. The most basic examples of penalty are the Akaike Information 
Criterion (AIC [5]) and the Bayes Information Criterion (BIC [6]) but these criteria are 
not well adapted in the segmentation context and tend to overestimate the number of 
change-points (see [7, 8] for theoretical explanations). In this particular context, some 
modified versions of these criteria have been proposed. For instance, [8, 9] have proposed 
modified versions of the BIC criterion (shown to be consistent) in the segmentation of 
Gaussian processes and DNA sequences respectively. However, these criteria are based 
on asymptotic considerations. In the last years there has been an extensive literature 
infiuenced by [10, 11] introducing non-asymptotic model selection procedures, in the sense 
that the size of the models as well as the size of the hst of models are allowed to be large 
when n is large. This penalized contrast procedure consists in selecting a model amongst 
a collection such that its performance is as close as possible to that of the best but 
unreachable model in terms of risk. This approach has been now considered in various 
function estimation contexts. In particular, [12] proposed a penalty for estimating the 
density of independent categorical variables in a least-squares framework, while [13, 14], 
or [15], focused on the estimation of the density of a Poisson process. 
When the number of models is large, as in the case of an exhaustive search in segmentation 
problem, it can be shown that penalties which only depend on the number of parameters 
of each model, as for the classical criteria, are theoretically (and also practically) not 
adapted. This was suggested by [16, 7] who show that the penalty term needs to be well 
defined, and in particular needs to depend on the complexity of the list of models, i.e. the 
number of models having the same dimension. For this reason, following the work of [10] 
and in particular [17] in the density estimation framework, we consider a penalized log- 
likelihood procedure to estimate the true distribution s of a Poisson or negative binomial- 
distributed sequence y. We prove that, up to a logn factor, the resulting estimator 
satisfies an oracle inequality. 

The paper is organized as follows. The general framework is described in Section 
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1. More precisely, we present our proposed penalized maximum- likelihood estimator, the 
form of the penalty and give some non- asymptotic risk bounds for the resulting estimator. 
The studies of the two considered models (Poisson and negative binomial) are done in 
parallel along the paper. Some exponential bounds are derived in Section 2. A simulation 
study is performed to compare our proposed criterion with others and an application to 
the segmentation of RNA-seq data illustrates the procedure in Section 3. The proof of 
the main result is given in Section 4 for which the proofs of some intermediate results are 
given in the Appendix 5. 

1 Model Selection Procedure 

1.1 Penalized maximum-likelihood estimator 

Let us denote by m a partition of [1, n], m = {|1, ri|, |ri, r2|, . . . , |rfc, n]} and by A4„ a 
set of partitions of [l, n]. In our framework we want to estimate the distribution s defined 
by s{t) — Q{9t, (f)), 1 < t < n, and we consider the two following models: 

GiOtA) =V{\t) (V) 

In the {.MB) case, we suppose that the over-dispersion parameter is known. We 
define the collection of models : 

Definition 1.1. The collection of models associated to partition m is Sm the set of 
distribution of sequences of length n such that for each element of Sm, for each segment 
J of TO, and for each t in J, Sm{t) = Q{9j, 0): 

Sm = {sm \ yj em, yt e J, Sm{t) = g{9j, (j))} . 

We shall denote by |m| the number of segments in partition to, and by | J| the length 
of segment J. 

We consider the log-likelihood contrast ^{u) — Ylt=i ~ logPu(yi), namely respectively 
for u{t) = V{lit) and u{t) = MB{qt,(t)), 

=Er=i/^t-^tiog(/^t) + iog(^tO, iv) 

= Er=i log <lt - Yt log(l - qt) - log . {MB) 

Then the minimal contrast estimator of s on the collection 5^ is 

Sm = arg min 7(u), (1) 

Yt 

so that, noting Yj = — j^j — , for all J G to and t G J 

Sm{t) = V{Yj) for {V) and s^{t) - J\fB (^j^y 0^ ior {MB). (2) 
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Therefore, for each partition m of we can obtain the best estimator Sm as in 
equation (2), and thus define a collection of estimators {{sm,)meM„}- Ideally, we would 
wish to select the estimator Sm{s) amongst this collection with the minimum given risk. 
In the log-likelihood framework, it is natural to consider the Kullback-Lciblcr risk, with 
K{s,u) = E [j{u) — 7(.s)]. In the following we note E and P the expectation and the 
probability under the true distribution s respectively (otherwise the underlying distribu- 
tion is mentioned). In our models, the KuUback-Leibler between distributions s and u 
can be developed into 

K{s.u) =Et, (/'.-A.-A,log^), (P) 
Ki.u) -^E;Ulog(|)+i^log(i^). iMB) 

Unfortunately, minimizing this risk requires the knowledge of the true distribution 
s, and is unreachable. We will therefore want to consider the estimator Sm where m 
minimizes 7(sm) + pen{m) for a wcll-chosen function pen (depending on the data). By 
doing so, we hope to select an estimator s,-„ whose risk is as close as possible to the risk 
of Sm(s) = argmin„e;vt„ Es[K{s, s^)] in the sense that 

B[K(s,s,n)]<CE[K(s,Smis))], 

where C is a nonnegative constant hopefully close to 1. We therefore introduce the fol- 
lowing definition: 

Definition 1.2. Let Ain be a collection of partitions of Il,n] constructed on a partition 
nif (i.e. m/ is a refinement of every m in A4n)- Given a nonnegative, increasing in the 
size of m penalty function pen: M.n R-t-; ^^^d choosing 

m = arg min {7(sm) +pen{m)}, 

meMn 

we define the penalized maximum-likelihood estimator as Srh- 

In the following Section we provide a choice of penalty function, and show that the 
resulting estimator satisfies an oracle inequality. 

1.2 Choice of the penalty function 

Main result 

The following result shows that for an appropriate choice of the penalty function, we have 
a non-asymptotic risk bound for the penalized maximum-likelihood estimator. 

Theorem 1.3. Let Ain be a collection of partitions constructed on a partition ruf such 
that there exist absolute positive constants Pmin, Pmax o-nd T satisfying: 

• Vi, Pmin <0t< Pmax and 
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• VJ e m/, |J| > r(log(n))^ 
Let {L^)meMn some family of positive weights satisfying 

S = ^ exp(— L^|m|) < +00. (3) 

m&Mn 

Let (3 > 1/2 in the Poisson case, /3 > 1/4 in the negative binomial case. If for every 
m e Mn 

pen{m) > p\m\ (1 + A^/L^y , (4) 

then 

E[/i^(s,s^)l < inf +pen(m)} + C(0,r,p^j„,p^„^,^,E), 

(16^)^/3 (4/3)1/3 

w^/i = /r,a\i/3 T "^^^^^ (^) = . . ou/3 T "^^^^^ {MB). 

(zpj — 1 V^Pj ' — 1 

We note h^{s,u) the squared Hellinger distance between distribution s and and 
is the projection of s onto the collection Sm according to the KuUback-Leibler distance. 
The proof of this Theorem is given in Section 4. 



Denoting = argmin„g5^ u), we have for J G m and t E J, 

(5) 



smit) =V{\j) where A^ = ^j^ {V) 



Sm{t) ^NB{pj,(t)) where pj = . {MB) 

We remark that the risk of the penalized estimator is treated in terms of Hellinger 
distance instead of the KuUback-Leibler information. This is due to the fact that the 
KuUback-Leibler is possibly infinite, and so difficult to control. It is possible to obtain a 
risk bound in term of KuUback-Leibler if we have a uniform control of 1 1 log(s/sj„) | |oo (see 
[18] for more explanation). 



Choice of the weights {Lm,m e Mn}- 

The penalty function depends on the family A^,„ through the choice of the weights L^ 
which satisfy (3). We consider for the set of all possible partitions of |1, nj constructed 
on a partition rnj which satisfies, for all segment J in m/, |J| > r(logn)^. Classically 
(see [19]) the weights are chosen as a function of the dimension of the model s, which is 
here |m|. The number of partitions of A4n having dimension D being bounded by (^), 
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we have 

n 

E = ^ e^-^lmi _ e-^^^Cardim e Mn, \m\ = D} 

meMn D=l 
D=l ^ ^ D=l 



" DlLo-l-log 



D=l 



So with the choice = 1 + K + log with k > Q, condition (3) is satisfied. Choosing, 
say K — 0.1, the penalty function can be chosen of the form 



pen(m) =/3|m| ( I + 4WI.I + log ( ) I , (6) 



where /3 is a constant to be caUbrated. 

Integrating this penalty in Theorem 1.3 leads to the following control: 

E[/i'(s,s^)] <Ci^mi sm) + P\m\ (^1 + 1.1 + log j | + C(0, T, E) (7) 

The following proposition gives a bound on the KuUback-Leibler risk associated to Sm'- 

Proposition 1.4. Letm be a partition of M.n, Sm be the minimum contrast estimator and 
Sm be the projection of s given by equations (2) and (5) respectively. Assume that there 
exists some positive absolute constants Pmin, Pmax o-nd V such that Vt, pmin < < Pmax 
and \J\ > r(logn)2. Then > 0, Va > 2 

K{s,Sm) hC2{e)\m\ < E[K{s,Sm)], 

1 1 - e 

where a < 1 is a constant that can be expressed according to n, C2{s) = y 

Poisson model (V) and C2{s) — p^j„T- rj in the negative binomial model (AfB). 

The proof is given in appendix 5.1. 

Combining proposition 1.4 and equation (7), we obtain the following oracle-type in- 
equality: 

Corollary 1.5. Let M.n be a collection of partitions constructed on a partition ruf such 
that there exist absolute positive constants Pmin, pmax o-nd V verifying: 
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• Vi, pmin < Ot < Pmax and 

• VJ e mf,\J\ > r(logn)2. 

There exists some absolute constant C such that 

E[/i'(s,s^)] < Clog(n) inf {E[X(s,s^)]} + C(0,r 

) Pmim Pmax J 

meMn 

2 Exponential bounds 

In order to prove Theorem 1.3, the general procedure in this model selection framework 
(see for example [19]) is the following: by definitions of m and (see definition 1.2 and 
equation (1)), we have, Vm e A4n 

l{srh) +pen{rh) < 7(5^) + pen{m) < 7(3^) +pen{m). 

Then, with j{u) — j{u) — E[j{u)], 

K{s,Srh) <K{s,Sm) + l{sm) " 7(sm) " pen{m) + pen{m) . 

The idea is therefore to control 7(5^) — 7(sm') uniformly over m' e Ain- This is more 
complicated when dealing with different models m and m'. Thus, following the work of [17] 
(see proof of Theorem 3.2, also recalled in [18]), we propose the following decomposition 

7(Sm) - 7(Sm') = (7(Sm') " 7(Sm')) + ili^) " 7(Sm')) + iliSm) " , (8) 

and control each term separately. The first term is the most delicate to handle, and 
requires the introduction and the control of a chi-square statistic. The main difficulty 
here is the non-bounded characteristic of the objects we are dealing with. Indeed, in the 
classic density estimation context such as that of [17], the objects are probabilities which 
are bounded and so facilitate the direct use of concentration inequalities. 
In our case, the chi-square statistic we introduce is denoted Xm defined by 

Xm ~ X i^m, — \J\ , (9) 



where we recall that Yj = — — and use the notation Ej = with Ej = V,^ jEf. 

\J\ l-'l ^tfcj 

Respectively for (V) and (A/'i3), we have Et = Xt and Et = 4'^~^- The purpose is thus to 

control uniformly over M.n- To this effect, we need to obtain an exponential bound 

of Yj — Ylit&j^t ^.round its expectation. In Subsection 2.1, we recall a result of [15] that 

we use to derive an exponential bound for (Subsection 2.2). 

2.1 Control of Fj 

First we recall a large deviation results estabhshed by [15] (lemma 3) that we apply in 
the Poisson and negative binomial frameworks. 
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Lemma 2.1. Let Yi, . . . ,Yn be n independent centered random variables. 
//log(E[e^'*"^]) < K— — ^ for all z e [0, 1/t[, and 1 < i < n, then 



2(1 - zr) 



<e for all x > 0. 



1=1 \ i=l 

If for 1 <i <n and all z > log(E[e-^'^*]) < kz%/2, then 



1/2 



i=l 



i=l 



<e ''for all x > 0. 



To apply this lemma we therefore need a majoration of log E [e^^^* -^'^j and logE [e ^*)] 
for z >0. 



Poisson case. 

With Et = \t, we have: 

logE [e"(^*-^*)] = -zXt + logE [e'^J] = -zXt + log e^^'^"'"^)^ = Xt{e' -z-1). 

So 

logE [e^^'^*--^*)] = Et(e' -z-1). 

Negative binomial Ccise. 

In this case Et — 6^—^ and we have 



logE e 



= -zc 



< 



< 



Pt 
Pt 



+ 01og 
-z) + 



1-{1-Pt)e^ 

i-Pt Pt 



Pt 1-(1-Pt)e 



for ^ < — log(l — Pt) 
ie' - 1) 



-) + ^(e^-l) 
Pt 



<(pL^(e' -z-1). 
Pt 



So that in both cases, 



logE [e"(^'-^*)] < Et{e' -z-1). 



Now using — 2; — 1 < „.,^_ for 2; > and — 2; — 1 < ^ for 2; < 0, we have 



2(l-z) 

logE [e"(^*-^*)] < E, 



and logE[e-(^*-^*)] <^,^ 



Then, 



Yj-Ej> y^2^j + < e-^ 



or 



P[Yj-Ej>x]<e and P [iFj - > x] < 2e 



x) 



(10) 
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2.2 Exponential bound for x 



2 

m 



We first introduce the following set defined by: 









{ 











(11) 



for all e g]0, 1[ and all segmentations m such that each segment J verifies \ J\ > r(log(n))^. 
This set has a large probability since we obtain 



< 2 J] e-l^l^'^(^'^-") < 2|m| exp(-£'r/(0, p^,„)(log(n))2) 



Jem 

by applying equation (10) with x — sEj and where e' — / (2(1 + e)) and /(0, Pmin) > 0. 
Thus 

with a > 2. 

The reason for introducing this set is double: in addition to enable the control of xL given 
by equation (9) on this restricted set, it allows us to link K{sjn, Sm) to (see (18) for 
the control of the first term in the decomposition) and so to x^, relation that we use to 
evaluate the risk of one model (see (20)). 

Let m/ be a partition of M.n such that VJ e m/, | J| > r(log(n))^ and assume that all 
considered partitions in A^„ are constructed on this grid ruf. The following proposition 



gives an exponential bound for Xm on the restricted event O^Je). 



Proposition 2.2. Let Yi, . . . ,Yn be independent random variables with distribution Q 

(Poisson or negative binomial distribution). Letm be a partition o/A1„ with \m\ segments 
and Xm the statistic given by (9). For any positive x, we have 



X^lQ^^(e) > |m| +8(l + £)/^ + 4(l + £)x 



< e' 



Proof. As in the density estimation framework, this quantity can be controlled using the 
Bernstein inequality. In our context, noting — Ylijem ■^J where 

{Yj - Ejf 
Ej ' 

we need 



• the calculation (or bounds) of the expectation of x^: 
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Poisson case 



Yj is distributed according to a Poisson distribution with parameter Aj so that 

E[X^] = H. (13) 



Negative binomial Ccise 

We have 



1 T^teJ^ 



Jem ' ' ^ Pj 



and thus 



1 



\m\ < E [x^] < \m\. 

Pmin 

an upper bound of X]jg^E[Z^]. For every p > 2 we have, 



(14) 



E 



+ 00 



JO 



2p x^P-'P [{\Yj - Ej\ >x}n 0„,(e)] dx 



1 rs^j 

< 2px''P-'P[\Yj-Ej\>x]dx. 
Using equation (10) and since x < eEj, we obtain the exponential bound P [\Yj — Ej\ > 



Therefore 



and 



E 



Z'jla^^ie)\ ^ Apx'^-'e-^-A^+^Wx 



< 



r+oo 2 
Jo 



r" + 00 



< 4p (1 + ef / {2tY-' e-'dt 

Jo 

< 2P+''p{l + eyp\, 



EE[^?ln.,(e)] < 2^+^(1 + £)>!|m|. 



Since p < 2p-\ 



< ^ X [2'{l + ef\m\] X [4(1 + ^)] 



p-2 
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We conclude by taking v = 2^ {1 + e f |m| and c = 4(1 + e) (see proposition 2.9 of 
[18] for the definition of the Bernstein's inequahty). 

□ 

3 Simulations and application 

In the context of RNA-seq experiments, an important question is the (re)-annotation of 
the genome, that is, the precise locahsation of the transcribed regions on the chromosomes. 
In an ideal situation, when considering the number of reads starting at each position, one 
would expect to observe a uniform coverage over each gene (proportional to its expression 
level), separated by regions of null signal (corresponding to non-transcribed regions of the 
genome). In practice however, those experiments tend to return very noisy signals that 
are best modelled by the negative binomial distribution. 

In this Section, we first study the performance of the proposed penalized criterion 
by comparing it with others model selection criteria on a resampling dataset (Subsection 
3.1). Then we provide an application on real data (Subsection 3.2). Since the penalty 
depends on the partition only through its size, the segmentation procedure is two-steps: 
first we estimate, for all number of segments K between 1 and K^ax, the optimal par- 
tition with K segments (i.e. construct the collection of estimators {sK}i<K<Kmax where 
Sk — argmin^^ meA^/f{7('^m)})- The optimal solution is obtained using a fast segmen- 
tation algorithm such as the Pruned Dynamic Programming Algorithm (PDPA, [20]) 
implemented for the Poisson and negative binomial losses or contrasts in the R package 
SegmentorSIsBack [21]. Then, we choose K using our penalty function which requires 
the calibration of the constant /3 that can be tuned according to the data by using the 
slope heuristic (see [7, 22]). Using the negative binomial distribution requires the knowl- 
edge of parameter (j). We propose to estimate it using a modified version of the Jonhson 
and Kotz's estimator [23]. 

3.1 Simulation study 

We have assessed the performances of the proposed method (called Penalized PDPA) on 
a simulation scenario by comparing to five other procedures both its choice in the number 
of segments and the quality of the obtained segmentation using the Rand- Index X. This 
index is defined as follows: let Ct be the true index of the segment to which base t belongs 
and let Ct be the corresponding estimated index, then 

(n-l)(n-2) ■ 

The characteristics of the different algorithms are described in Table 1. 

The data we considered comes from a resampling procedure using real RNA-seq 
data. The original data, from a study by the Sherlock Genomics laboratory at Stan- 
ford University, is publicly available on the NCBIs Sequence Read Archive (SRA, url: 
http://www.ncbi.nlm.nih.gov/sra) with the accession number SRA048710. We created 



2E: 
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Algorithm 


Dist 


Gomplexity 


Inference 


Pen 


Exact 


Reference 


Penalized PDPA 


NB 


nlogn 


frequentist 


external 


exact 


[21] 


PDPA with BIG 


NB 


nlogn 


frequentist 


external 


exact 


[21] 


Penalized PDPA 


P 


nlogn 


frequentist 


external 


exact 


[21] 


PDPA with BIG 


P 


nlogn 


frequentist 


external 


exact 


[21] 


PELT with BIG 


P 


n 


frequentist 


internal 


exact 


[24] 


GART with BIG 


P 


nlogn 


frequentist 


external 


heuristic 


[25] 


postGP with IGL 


NB 


n 


frequentist 


external 


exact 


[4] 


EBS with IGL 


NB 


n? 


B ayes i an 


external 


exact 


[26] 



Table 1: Properties of segmentation algorithms. The first column indicates the name of 
the algorithm and the criterion used for the choice of K. In the second column, NB stands 
for the negative binomial distribution and P for Poisson. The time of each algorithm is 
given (column "Complexity") and column "Exact" precises if the exact solution is reached. 

an artificial gene, inspired from the Drosophila inr-a gene, resulting in a 14-segment sig- 
nal with unregular intensities mimicking a differentially transcribed gene. 100 datasets 
are thus created. Results are presented using boxplots in Figure 3.1. Because PELT's 
estimate of K averaged around 427 segments, we did not show its corresponding boxplot. 

We can see that with the negative binomial distribution, not only do we perfectly 
recover the true number of segments, but our procedure outperforms all other approaches. 
Moreover, the impressive results in terms of Rand-Index prove that our choice of number 
of segments also leads to the almost perfect recovery of the true segmentation. However, 
the use of the Poisson loss leads to a constant underestimation of the number of segments, 
which is reflected on the Rand-Index values. This is due to the inappropriate choice of 
distribution (confirmed by the other algorithms implemented for the Poisson loss which 
perform worse than the others). It however underlines the need for the development of 
methods for the negative binomial distribution. Moreover, in terms of computational 
time, the fast algorithm [21] is in C (nlogn), allowing its use on long signals (such as a 
whole- genome analysis), even though it is not as fast as CART or PELT. 

3.2 Segmentation of RNA-Seq data 

We apply our proposed procedure for segmenting chromosome 1 of the S. Cerevisiae (yeast) 
using RNA-Seq data from the Sherlock Laboratory at Stanford University [1] and publicly 
available from the NCBI's Sequence Read Archive (SRA, urkhttp:/ /www. ncbi.nlm.nih.gov/ 
accession number SRA048710). An existing annotation is available on the Saccharomyces 
Genome Database (SGD) at urhhttp:/ /www. yeastgenome.org, which allows us to validate 
our results. The two distributions (Poisson and negative binomial) are considered here to 
show the difference. 

In the Poisson distribution case, we select 106 segments of which only 19 are related to 
the SGD annotation. Indeed, as illustrated by Figure 2, 36 of the segments have a length 
smaller than 10: the Poisson loss is note adapted to this kind of data with high variability 
and it tends to select outhers as segment. On the contrary, we select 103 segments in the 
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Estimation of K Rand Index 
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Figure 1: Estimation of K on resampled datasets. Left: boxplot of the estimation 
of K on data-sets simulated by resampling on artificial gene Inr-a. PELT's estimates 
average at 427 segments and are not shown. The pink horizontal line indicates the true 
value of K. Right: boxplot of the Rand- Index values for the proposed estimators. 

negative binomial case most of which (all but 3) surround known genes from the SGD. 
Figure 3 illustrates the result. However, almost none of those change-points correspond 
exactly to annotated boundaries. Discussion with biologists has increased our belief in the 
need for genome (re-) annotation using RNA-seq data, and in the validity of our approach. 

4 Proof of Theorem 1.3 

Recall that we want to control the three terms in the decomposition given by (8). All the 
proofs of the different propositions are given in Section 5. 

• The control the term 7(3^') — 7(sm') is obtained with the following proposition 
where the set f2i(0 is defined by 

^i(0= n {x^'lf^™.,(e) < \m'\+8{l + e)^/{L^,\m'\+0\m'\+4{l + e){L^,\m'\+0}. 
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position 



Figure 2: Segmentation of the yeast chromosome 1 using Poisson loss. Read-count are 
represented on a root-squared scale. The model selection procedure chooses K = 106 
segments. 



Proposition 4.1. Let m' be a partition of Ain- Then 



(7(sm') - l{sm')) ln„^.(e)nOi(0 < C{e) \m'\ + 8(1 + e)^/{Lm'\m'\ + 

+ 4(1 + e){L^,\m'\ + 0] + ^^^(^™'. s^'), 
with C{e) = ^ (]Tif) the Poisson case and C{e) = ^ in the negative binomial 



case. 



• The control of the term 7(5^) — l{s), or more precisely its expectation, is given by 
the following proposition: 

Proposition 4.2. 

, [(7(S™) -7(s))W{e)]| < -7^31)7^ • 
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Figure 3: Segmentation of the yeast chromosome 1 using the negative binomial loss. The 
model selection procedure chooses K = 103 segments, most of which surround genes given 
by the SGD annotation. 

• To control 7(5) —7(3^')) use the following proposition which gives an exponential 
bound for 7(5) — 7(m). 

Proposition 4.3. Let s andu he two distributions of a sequence Y. Let'-/ be the log- 
likelihood contrast, 7(m) = 'j{u) — E['~f{u)], and K{s,u) and h'^{s,u) be respectively 
the Kullback-Leibler and the squared Hellinger distances between distributions s and 
u. Then Va; > 0, 

^[l{s)-^{u)>K{s,u)-2h^{s,u)+2x\ < e"^. 

Applying it to m = Sm' yields: 

P['y{s)-^{s^,)>K{s,s^,)-2h\s,s^,) + 2x] < e"". (16) 
We then define Q2{0 = Om'eMn {^(*) " ^i^rn') < K{s, s^>) - 2h^{s, s^>) + 2{L^, \m'\ + C)}. 

Let VL{e,S,) = Qmfi^) ^^2{0- Then, combining equation (16) and proposition 

4.1, we get for m' = rh, 
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(7(sm) - 7(sm))ln(6,o = (lis) - 7(s^))1q(,,^) + (7(5^) - 7(s))ln(,,^) + (7(5^) - 7(s^))ln(,,^) 



< [K{ 



+C{e) \m\ + 8(1 + e)y/{Lrh\rh\ + ^)\m\ + 4(1 + £)(L^|m| + 
+2L^|m|+2e, 

with it! = 7(5,71) — 7(5). So that 

+C(e) |m| +8(1 +£)v/(Z^^^^^^T0W + 4(l + £)(i:A|^Ti| +0 
+K{s,Sm)ln{e,o + '^Ljh\m\ + 2^ + Rla(^e,o - P(^n{m) +pen{m). 

And since 

• K{s, Srn) = K{s, Srh) + K{srh, Sm) (§66 equation (17)), 

• K{s,u) > 2h^{s,u) (see lemma 7.23 in [18]), 

• h^{s,Srh) < 2{h^{s,Srh) + h'^{srh,Sm)) (using inequahty 2ab < ko^ + K~^b^ with 



— — /i^(s, Srn)lf2(e,o < K{s, Sm)la{e,o + - pen{m) + pen{m) 

J- I £ 



+|m|C(£) 1 + (1 + £) 8 + e + AL 



+2^ 



+ 2Lrh\rh\ 



l + C{e) (8(l + £)-+4(l + £; 



But 



C(£) [l + (l + £) (8Vl^ + £ + 4L^)] +2L^ < C(£) [l + (l + £) (£ + 8\/l;;;+8L^) 



1 + 8VL^ + 8L 



< C2{e) 



with C2(£) = J f for (P) and C2(£) = 7(1 + ef for (AT-B). So we have 

2 y 1 — £/ 4 

^--p^/i^(s, s^)ln(e,5) < i^(s, s^)lQ(e,o + ^lo(e,^) " pen(m) + pen{m) 



+\m\C2{e) (l + 4v^)' + 2e 



1 + (1+£)C(£) - + 2 



By assumption, pen{m) > f3\m\ (l + 4-\/L^)^. Choosing /? = C2{e) yields 
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Then, using propositions 4.2 and 4.1, we have P (ni(^)'^) < ^^/g^vi e"^'"'!"*'!''"^ and 
P (^^2(0'') < E„.'eA4„ e-^"^'l-'l+«. So that using hypothesis (3), 

m'eMn 

and thus P l~l ^2(0) > 1 ~ 2Ee~^. We now integrate over ^, and using equation 

(15), we get with a probabihty larger than 1 — 2Ee~'^ 



E 



K{s,Sm) H T—rrj^ ^ pen{m) 



And since E 



< 



^(a-l)/2 
^(0) Pmini Pmaxi Pi 



+ EC(^). 



a-l 



, we have 



Pmim Pmax-i 

Finally, by minimizing over m e Ain, we get 

Pmin 1 Pmax i 

meMn 



5 Appendices 

5.1 Proof of proposition 1.4 

Using Pythagore-type identity, we obtain the following decomposition (see for example 
[17]): 

K{S, Sm) = K{s, Sm) + K{s 

)■ (17) 

The objective is then to obtain a lower bound of 'Ei[K{sm-, Sm)] in the two considered 
distribution cases. 

Poisson caise 

We have 

where $(x) — — 1 — x. Since ^x^(l A e^) < $(x) < \x^(l V e^), then on (e), we 

2 2 

have 
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So 



where 



1 log X 1 

And using, for x > 0, 7773 < - — - < ^ ^ we get, on ^^^.(e) 



1 Vx ~ X- 1 ~ 1 Ax 

2 /\.m — m — 



2 -^m' 



So 



1 -£ 



1+6 



On one hand, E [x^] = |m|, and 



1 -e 
2(1 + ^) 
Since < 



Iml -E 



1 



< E 



^m; ^m) Ifin 



< 



2(1 -e) 



(18) 



(19) 



(20) 



m — r(log(n))2pmin ^^em 

Schwarz Inequahty, we get 



1 



r(log(n))2pr, 



(Et - Et ^tf^ using Cauchy- 



E 



< 



1 



r(log(n))2p. 

— C'(r, Pmini Prr 



1 1/2 



3 E^. +E-^. 



P(Ji„,(£)'^)'/^ 



n 



(log(n))2 ^ ™^ 

^(0) r, Pmini Pmaxi ^i ^) 



P((]mJ6)^)V2 



< 



a/2-a 



where a = 1 - 2^^^^^^, n > 2. For example, a = 0.62 for n = 10^ 



On the other hand, using logl/x > 1 — x for all x > 0, E ir(sm, Sm)ln„^(e)c 
Finally, we have 



> 0. 



1 ^11 ^liy 1 Pmini Pmaxi ^1^) , Tnrr^/' ^ M 

\m\ < E[A:(s,Sm)J, 



2(l+£)2' 



a/2-a 
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Negative binomial case 

I J| 1 — X 

'WehecveK{sm,Sm) = ^ZljFm — (Pj) andVO < a < 1, ha{x) > 

Pj 't'+^j 1 — a 

Then on Q-rnf (e) 



lo£ 



Introducing 



we get 



1^1 ^ "^log2 I ^ 



Jem ^ 4>+Yj 



Y, 



Pj 



1-Pj 



K{Sm,Sm) > K 



2 



and since Yj - = ^ - (1 - p^)) , we have 







+ lj 







log 



1 -Pj 



0+yj 

1-Pj 



- 1 



And finally, 



,2 (l-^)\,2 



Moreover, on one hand we have \m\ < E [y^I < Iml. On the other hand, 

Pmin 

- r(iog(n))4(i-P^a.) (5^*^* ~ E^^^)^ using Cauchy-Schwarz Inequality, we get 



E 



!5 ^ — 



1/2 



r(log(n))2(^(l 

Prnax ) 

r, Pmini Pmaxi ^i 



< 



a/2-a 



where a = 1 - 2 ^°^,^^°ff\ n > 2. Finally, we have 



ml < E[K(s,s^)J. 



a/2-a 
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5.2 Proof of proposition 4.1 
Poisson case 

Y 

The term to be controlled is ^{sm') — ^{sm') = I^jem' l^^l {^J ~ '^■^) ^^S Using Cauchy- 

A.r 



Schwarz inequality, we have 



with Xm' ^■nd V^, defined as in equations (9) and (19). Then, using equation (18) 



(7(Sm') -7(5m'))ln^^(6) < ^:^^K{Sm',Sm'), 

and using 2ab < ko^ + for all k > 0, we get 



K 



(7(Sm') -7(5m'))lf2„^(6) < 2^m' + ^;3^^(Sm',S^')- 

l+£ 



And with proposition 2.2, we get, for k 

(7(^m') - 7(sm')) ln^,(c)nni(0 



1 -£ 



2C(£), 



< 



l + £ 



m 



2(1 -e) 
Negative binomial Ccise 



'I +8(l + £)V(^m'|m'| +4(l + £)(L^/|m'| +0 + T^i^(s„', 



(23) 



l + £ 



In this case we can write ^{sm') - l{sm') = Sjem' l>^l (^^ ~ ^j) ^og -^±^. Again, using 

i — Pj 

Cauchy-Schwarz inequality, and with and defined by equations (9) and (21), we 
get 



SO that with equation (22) and 2ab < ko^ + for all k > 

(7(Sm') -7(Sm'))ln^^(6) < -^Xm' + ^K{Sm',Sr. 



(24) 



Finally, with proposition 2.2 and k, 
{l{sm>) - 7(s^')) ln^.(e)nni(0 



1 + e 



2C(£), 



< 



l+£ 



Im'l + 8(1 + e)JiLm'\m'\ + nim'l + 4(1 + e)iLm'\m'\ + + Ki 

J 1 + £ 
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5.3 Proof of proposition 4.2 
Poisson case 

Noting that £[(7(5^) - 7(s))ln^^(6)] = -E[(7(sm) - 7(s))ln^^(,)c], we have 
|E[(7(sr.)-7(5))ln.,(e)]| < |E[(7(^^)-7(5))la.^(.)c]| <E[|(7(s„)-7(s))|la^^(,)c] 



< E 



n 



J t 

< log (pmax/Pmin) X E 

/r 



< log (Pmax/Pmin) X 



1 1/2 



X (P(Q„,(e)^) 



< {nprnaxf^"^ X log (pmax/Pmin) X (P(fimj- (e)*^) 

which concludes the proof. 
Negative binomial case 

Once again, E[(7(s^) - 7(s))ln„.^(6)] = -£[(7(5^) - 7(s))ln^^(6)c], and 
|E[(7(^J-7(^))1^7.,(e)]| < |E[(7(s™)-7(^))ln„,(.)c^]| <E[|(7(^J-7(^))|ln„,(e)c] 



< E 



J t 

< log (1/(1 - xE 

1/2 



.1 -Pt 



log (1/(1 -Pmm)) 



< 



/ 1 \ 1 

X log — X (P(l]^,(e)^)^/^ 

V PminJ Pmin 



which concludes the proof. 
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5.4 Proof of proposition 4.3 

Using the Markov inequality P [7(5) - 'y{u) > b] < inf„ [e'^^'E (e«(^(^)-^(")))] with a = |, 
we get 

< exp 



< exp 



< exp 

< exp 

where = P denote the probabihty under the distribution s. Thus 

P [7(5) - 7(u) > K{s, u) - 2h'^{s, u) + 2x] < e"^. 

The authors wish to thank Stephane Robin for more than helpful discussions on the statistical 
aspect and Gavin Sherlock for his insight on the biological applications. 
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